Linear regression - recap

Now we assume that any change in y is due to a change in x.

Example of linear regression

Effects of starvation and humidity on water loss in the confused flour beetle. Here, looking at the relationship between humidity and weight loss

flr_beetle <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter8/Data/nelson.csv")
flr_beetle
##   HUMIDITY WEIGHTLOSS
## 1      0.0       8.98
## 2     12.0       8.14
## 3     29.5       6.67
## 4     43.0       6.08
## 5     53.0       5.90
## 6     62.5       5.83
## 7     75.5       4.68
## 8     85.0       4.20
## 9     93.0       3.72

Plot these data

library(ggplot2)
ggplot() +
  geom_point(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
  theme_bw()

Run a linear regression

flr_beetle_lm <- lm(data = flr_beetle, WEIGHTLOSS ~ HUMIDITY)
summary(flr_beetle_lm)
## 
## Call:
## lm(formula = WEIGHTLOSS ~ HUMIDITY, data = flr_beetle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46397 -0.03437  0.01675  0.07464  0.45236 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.704027   0.191565   45.44 6.54e-10 ***
## HUMIDITY    -0.053222   0.003256  -16.35 7.82e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2967 on 7 degrees of freedom
## Multiple R-squared:  0.9745, Adjusted R-squared:  0.9708 
## F-statistic: 267.2 on 1 and 7 DF,  p-value: 7.816e-07

Plot these data, with lm fit

ggplot(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
  geom_point() +
  stat_smooth(method = "lm") +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

Linear regression assumptions

The big three:

  1. Normality: The \(y_i\) AND error values (\(\epsilon_i\)) are normally distributed. If normality is violated, p-values and confidence intervals may be inaccurate and unreliable.
  2. Homogeneity of variance: The \(y_i\) AND error values (\(\epsilon_i\)) have the same variance for each \(x_i\).
  3. Independence: The \(y_i\) AND error values are independent of each other.

Other assumptions:

  • Linearity: The relationship between \(x_i\) and \(y_i\) is linear (only important when using simple linear regression).
  • Fixed X values: The \(x_i\) values are measured without error. In practice this means the error in the \(x_i\) values is much smaller than that in the \(y_i\) values.

Line fitting and model evalutation

You can access a video going through this section here - https://youtu.be/2tI5YFDajp8

Finding the best-fit line

We have jumped into linear regression by going through the example above relating flour beetle weight loss to humidity, but we have not gone into detail about how “the best fit line” is actually calculated.

We are not going to go into the mathematical details of line fitting here (you can get more details looking at Box 5.5 on pp. 83-84 in Quinn and Keough 2002), but we will go through an intuitive explanation.

Recall, that our goal is to fit a line through the cloud of data points such that the distance between the points and the line is minimized. These distances are what we call residuals (or errors), and are symbolized with \(\epsilon\) or \(e\).

Looking at the plot made with the code below, the vertical black lines represent the residuals.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
flr_beetle$y_hat <- predict(flr_beetle_lm, newdata = flr_beetle)
flr_beetle$y_min <- apply(select(flr_beetle, WEIGHTLOSS, y_hat), MARGIN = 1, FUN = min)
flr_beetle$y_max <- apply(select(flr_beetle, WEIGHTLOSS, y_hat), MARGIN = 1, FUN = max)

ggplot(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE) +
  geom_linerange(aes(ymin = y_min, ymax = y_max )) +
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

Again, our goal is to fit a line that minimizes these residuals. In a mathematical sense, what we want to do is find estimates of \(\beta_0\) and \(\beta_1\), i.e., our slope and intercept values, to use in the equation of this line, \(\hat{y} = b_0 + b_1 x\), such that we minimize the value \(\sum(y_i - \hat{y})^2\).

This last equation, \(\sum(y_i - \hat{y})^2\), goes by several names, including the sum of squared residuals, sum of squared errors, and sum of squared deviations. For short, I’ll often use \(SS_{res}\).

It is possible to minimize \(SS_{res}\) by taking a guess at \(b_0\) and \(b_1\), calculating \(SS_{res}\), taking another guess at the parameters, and accepting them if \(SS_{res}\) is smaller. Keep doing this until to can minimize \(SS_{res}\) any further. But this is a pretty inefficient approach. Instead, using the probability distribution we derived last week for this line and a few principles of calculus, we can find the values of \(b_0\) and \(b_1\) that minimize \(SS_{res}\). We saw these equations last week.

My model looks good, but is it meaningful?

In order to determine if your linear regression model is meaningful (or significant) you must compare the variance explained by your model versus the variance NOT explained by your model (conveniently referred to as the variance unexplained).

Your residuals represent the variation that is not explained!

Logan - Figure 8.3

Logan - Figure 8.3

Note: there is a typo in this figure in panel (c). Instead of “Explained variability”, the arrow tag should be “Unexplained variability”.

In this figure, we see in panel (d) that we compare the variation explained to the variation unexplained, and use an F-ratio test to see if the amount of variation ascribed to each are different. If the explained variation is relatively large compared to the unexplained variation, then this ratio will be large. Larger \(F\) ratios are more extreme, when considering an F distribution. Thus, an extreme \(F\) ratio would imply that the two variances are not equal, and specifically, that the amount of variance explained is significantly greater than that unexplained.

Standard error of regression coefficients, the regression line, and \(\hat{y}_i\) predictions

Regression coefficients are statistics and thus we can determine the standard error of these statistics. From there, we can use these values and the t-distribution to determine confidence intervals. (Note that we can also see if these coefficients are significantly different from 0 using a Wald Test (similar to the \(t\) test).)

The confidence interval for \(b_1\) is:

\[ b_1 \pm t_{0.05, n-2}s_{b_{1}} \]

\(\beta_1\) standard error

\[ s_{b_{1}} = \sqrt{ \frac{MS_{Residual}}{\sum^n_{i=1}(x_i-\bar{x})^2} } \]

\(\beta_0\) standard error

\[ s_{b_{0}} = \sqrt{ MS_{Residual} [\frac{1}{n} + \frac{\bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}] } \]

Confidence bands for regression line

From Quinn and Keough, p. 87:

The 95% confidence band is a biconcave band that will contain the true population regression line 95% of the time.

\(\hat{y}_i\) standard error

\[ s_{\hat{y}} = \sqrt{ MS_{Residual} [1 + \frac{1}{n} + \frac{x_p - \bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}] } \]

where \(x_p\) is the value from \(x\) we are “predicting” a \(y\) value for.

Linear regression diagnostics

If we plot the lm result, we will see a set of diagnostic plots:

plot(flr_beetle_lm)

  • Leverage: a measure of how extreme a value in x-space is (i.e., along the x-axis) and how much of an influence each \(x_i\) has on \(\hat{y}_i\). High leverage values indicate that model is unduly influenced by an outlying value.

  • Residuals: the differences between the observed \(y_i\) values and the predicted \(\hat{y}_i\) values. Looking at the pattern between residuals and \(\hat{y}_i\) values can yield important information regarding violation of model assumptions (e.g., homogeneity of variance).

  • Cook’s D: a statistics that offers a measure of the influence of each data point on the fitted model that incorporates both leverage and residuals. Values \(\ge 1\) are considered “suspect”.

Regression and regression-like techniques we will not be covering in depth

Below is a list of techniques that are covered well in both of the class textbooks.

Multiple linear regression

What do we do when we have two or more predictors?

See written notes – page 2

Additive model

Only the predictors themselves are considered.

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_j x_{ij} + \epsilon_i \]

Multiplicative model

The predictors and the interactions between predictors are considered.

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2} + ... + \beta_j x_{ij} + \epsilon_i \]

where \(\beta_3 x_{i1} x_{i2}\) is the interactive effect of \(X_1\) and \(X_2\) on \(Y\) and it examines the degree to which the effect of one of the predictor variables depends on the levels of the other (Logan p. 209).

NOTE: There are many more coefficients that need to be estimated in the multiplicative model!

Additional assumption(s) to consider

  • (Multi)collinearity - the predictor variables should not be highly correlated.

Example of collinearity

Temperature in June and temperature in July

Example of (potentially) meaningful interaction

Temperature in June and rainfall in June

Assessing collinearity

  • Look at pair-wise correlation values among all predictor variables
  • calculate tolerance, or its inverse Variance Inflation Factor, for each predictor variable. For VIF, we should be cautious if values are greater than 5, and down right concerned if they are greater than 10.

Example of addative multiple linear regression

Loyn (1987) investigated the effects of habitat fragmentation on abundance of forest birds. He considered multiple predictor variables associated with fragmentation or other important environmental conditions.

Get data and examine.

bird_frag <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter9/Data/loyn.csv")
summary(bird_frag)
##      ABUND            AREA            YR.ISOL          DIST       
##  Min.   : 1.50   Min.   :   0.10   Min.   :1890   Min.   :  26.0  
##  1st Qu.:12.40   1st Qu.:   2.00   1st Qu.:1928   1st Qu.:  93.0  
##  Median :21.05   Median :   7.50   Median :1962   Median : 234.0  
##  Mean   :19.51   Mean   :  69.27   Mean   :1950   Mean   : 240.4  
##  3rd Qu.:28.30   3rd Qu.:  29.75   3rd Qu.:1966   3rd Qu.: 333.2  
##  Max.   :39.60   Max.   :1771.00   Max.   :1976   Max.   :1427.0  
##      LDIST            GRAZE            ALT       
##  Min.   :  26.0   Min.   :1.000   Min.   : 60.0  
##  1st Qu.: 158.2   1st Qu.:2.000   1st Qu.:120.0  
##  Median : 338.5   Median :3.000   Median :140.0  
##  Mean   : 733.3   Mean   :2.982   Mean   :146.2  
##  3rd Qu.: 913.8   3rd Qu.:4.000   3rd Qu.:182.5  
##  Max.   :4426.0   Max.   :5.000   Max.   :260.0

Plot these data.

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
ggpairs(bird_frag)

We can clearly see here that some of our predictor variables are not normally distributed, particularly AREA, and to a lesser extent DIST and LDIST. We will try to transform these data, to see if we can get them to look a bit more normal.

bird_frag_transform <- bird_frag

bird_frag_transform$AREA <- log10(bird_frag$AREA)
bird_frag_transform$DIST <- log10(bird_frag$DIST)
bird_frag_transform$LDIST <- log10(bird_frag$LDIST)

Re-plot.

ggpairs(bird_frag_transform)

We also see here that there are no extremely high correlations among the predictor variables. How we define “extremely high” is relatively arbitrary, but generally anything greater than 0.8.

To be safe, we’ll examine the VIFs.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(lm(data = bird_frag, ABUND ~ log10(AREA) + YR.ISOL + log10(DIST) + log10(LDIST) + GRAZE + ALT))
##  log10(AREA)      YR.ISOL  log10(DIST) log10(LDIST)        GRAZE          ALT 
##     1.911514     1.804769     1.654553     2.009749     2.524814     1.467937

And, we’re good.

Now run the multiple linear regression model.

bird_frag_lm <- lm(data = bird_frag, ABUND ~ log10(AREA) + YR.ISOL + log10(DIST) + log10(LDIST) + GRAZE + ALT)

Diagnostic plots.

plot(bird_frag_lm)

Looks good!

Now look at the model summary.

summary(bird_frag_lm)
## 
## Call:
## lm(formula = ABUND ~ log10(AREA) + YR.ISOL + log10(DIST) + log10(LDIST) + 
##     GRAZE + ALT, data = bird_frag)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6506  -2.9390   0.5289   2.5353  15.2842 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -125.69725   91.69228  -1.371   0.1767    
## log10(AREA)     7.47023    1.46489   5.099 5.49e-06 ***
## YR.ISOL         0.07387    0.04520   1.634   0.1086    
## log10(DIST)    -0.90696    2.67572  -0.339   0.7361    
## log10(LDIST)   -0.64842    2.12270  -0.305   0.7613    
## GRAZE          -1.66774    0.92993  -1.793   0.0791 .  
## ALT             0.01951    0.02396   0.814   0.4195    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.384 on 49 degrees of freedom
## Multiple R-squared:  0.6849, Adjusted R-squared:  0.6464 
## F-statistic: 17.75 on 6 and 49 DF,  p-value: 8.443e-11

What if we wanted to look at each partial regression plot? Use av.plots from the car package.

avPlots(bird_frag_lm, ask = F)

An alternative way to approximately visualize the relationships between each predictor and the response variable is to look at marginal regression plots.

library(reshape)
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
## 
##     rename
bird_frag_transform_melt <- melt(bird_frag_transform, id.vars = c("ABUND"))

ggplot(data = bird_frag_transform_melt, aes(x = value, y = ABUND)) +
  geom_point() +
  stat_smooth(method = "lm") +
  facet_wrap( ~variable, scales = "free" )
## `geom_smooth()` using formula 'y ~ x'

Example of multiplicative multiple linear regression

Paruelo and Lauenroth (1996) examined the relationships between the abundance of \(C_3\) plants (those that use \(C_3\) photosynthesis) and geographic and climactic variables. Here we are only going to consider the geographic variables.

Get data and examine.

c3_plants <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter9/Data/paruelo.csv")
summary(c3_plants)
##        C3              MAP            MAT             JJAMAP      
##  Min.   :0.0000   Min.   : 117   Min.   : 2.000   Min.   :0.1000  
##  1st Qu.:0.0500   1st Qu.: 345   1st Qu.: 6.900   1st Qu.:0.2000  
##  Median :0.2100   Median : 421   Median : 8.500   Median :0.2900  
##  Mean   :0.2714   Mean   : 482   Mean   : 9.999   Mean   :0.2884  
##  3rd Qu.:0.4700   3rd Qu.: 575   3rd Qu.:12.900   3rd Qu.:0.3600  
##  Max.   :0.8900   Max.   :1011   Max.   :21.200   Max.   :0.5100  
##      DJFMAP            LONG            LAT       
##  Min.   :0.1100   Min.   : 93.2   Min.   :29.00  
##  1st Qu.:0.1500   1st Qu.:101.8   1st Qu.:36.83  
##  Median :0.2000   Median :106.5   Median :40.17  
##  Mean   :0.2275   Mean   :106.4   Mean   :40.10  
##  3rd Qu.:0.3100   3rd Qu.:111.8   3rd Qu.:43.95  
##  Max.   :0.4900   Max.   :119.5   Max.   :52.13

Again, we’re only going to consider the geographic variable - latitude and longitude (LONG and LAT).

Have a look at these data in graphical format.

library(ggplot2)
library(GGally)

ggpairs(c3_plants)

C3 abundance is not normally distributed. Let’s convert using a log10(C3 + 0.1) conversion. The + 0.1 is needed because the log of 0 is negative infinity.

Also, we should center the lat and long data.

c3_plants$LAT <- scale(c3_plants$LAT, scale = FALSE)
c3_plants$LONG <- scale(c3_plants$LONG, scale = FALSE)

Make and look at non-interaction model.

c3_plants_lm_noint <- lm(data = c3_plants, log10(C3 + 0.1) ~ LONG + LAT)

summary(c3_plants_lm_noint)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG + LAT, data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50434 -0.20010 -0.02084  0.20813  0.43932 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.545622   0.028395 -19.216  < 2e-16 ***
## LONG        -0.003737   0.004464  -0.837    0.405    
## LAT          0.042430   0.005417   7.833 3.71e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2426 on 70 degrees of freedom
## Multiple R-squared:  0.4671, Adjusted R-squared:  0.4519 
## F-statistic: 30.68 on 2 and 70 DF,  p-value: 2.704e-10

Make and look at model with interactions. We can create a model with an interaction term in two ways. They yield identical results.

c3_plants_lm1 <- lm(data = c3_plants, log10(C3 + 0.1) ~ LONG + LAT + LONG:LAT)

c3_plants_lm2 <- lm(data = c3_plants, log10(C3 + 0.1) ~ LONG*LAT)

Check out the summary results of both.

summary(c3_plants_lm1)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG + LAT + LONG:LAT, data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5529416  0.0274679 -20.130  < 2e-16 ***
## LONG        -0.0025787  0.0043182  -0.597   0.5523    
## LAT          0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:LAT     0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11
summary(c3_plants_lm2)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * LAT, data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5529416  0.0274679 -20.130  < 2e-16 ***
## LONG        -0.0025787  0.0043182  -0.597   0.5523    
## LAT          0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:LAT     0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11

Interpreting the partial regression coefficients

The partial regression coefficients (i.e., partial slopes) are the slopes of specific predictor variables, holding all other predictor variables constant at their mean values.


Look at diagnostic plots for model with interaction term.

plot(c3_plants_lm1)

Model selection

We want a model that contains enough predictor variables to explain the variation observed, but not one that is over fit. Also, it is important that we not lose focus of the biological questions we are asking. Sometimes, it is best to keep certain predictor variables in a model, even if they are not statistically important, if they are essential to answering our question.

Compare the models without and with an interaction term.

Using anova

Compares the reduction in the residual sums of squares for nested models.

anova(c3_plants_lm_noint,c3_plants_lm1)
## Analysis of Variance Table
## 
## Model 1: log10(C3 + 0.1) ~ LONG + LAT
## Model 2: log10(C3 + 0.1) ~ LONG + LAT + LONG:LAT
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
## 1     70 4.1200                             
## 2     69 3.7595  1   0.36043 6.615 0.01227 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using AIC

Akaike Information Criteria - a relative measure of the information content of a model. Smaller values indicate a more parimonious model. Penalizes models with larger number of predictor variables. As a rule of thumb, differences of greater than 2 (i.e., \(\Delta AIC >2\)) are considered meaningful.

AIC(c3_plants_lm1)
## [1] 0.6350526
AIC(c3_plants_lm_noint)
## [1] 5.318109

The effect of the interaction

Let’s look at the effect of the interaction between LAT and LONG by examining the partial regression coefficient for LONG at different values of LAT. Here we are going to look at the mean latitude value, \(\pm\) 1 and 2 standard deviations.

## mean lat - 2SD 
LAT_sd1 <- mean(c3_plants$LAT)-2*sd(c3_plants$LAT)
c3_plants_LONG.lm1<-lm(log10(C3+.1)~LONG*c(LAT-LAT_sd1), data=c3_plants)
summary(c3_plants_LONG.lm1)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * c(LAT - LAT_sd1), data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.0662239  0.0674922 -15.798  < 2e-16 ***
## LONG                  -0.0264657  0.0098255  -2.694  0.00887 ** 
## c(LAT - LAT_sd1)       0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:c(LAT - LAT_sd1)  0.0022522  0.0008757   2.572  0.01227 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11
## mean lat - 1SD 
LAT_sd2<-mean(c3_plants$LAT) - 1*sd(c3_plants$LAT)
c3_plants_LONG.lm2<-lm(log10(C3+.1)~LONG*c(LAT-LAT_sd2), data=c3_plants)
summary(c3_plants_LONG.lm2)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * c(LAT - LAT_sd2), data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.8095827  0.0417093 -19.410  < 2e-16 ***
## LONG                  -0.0145222  0.0060025  -2.419   0.0182 *  
## c(LAT - LAT_sd2)       0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:c(LAT - LAT_sd2)  0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11
## mean lat + 1SD 
LAT_sd4<-mean(c3_plants$LAT) + 1*sd(c3_plants$LAT)
c3_plants_LONG.lm4<-lm(log10(C3+.1)~LONG*c(LAT-LAT_sd4), data=c3_plants)
summary(c3_plants_LONG.lm4)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * c(LAT - LAT_sd4), data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.2963004  0.0399955  -7.408 2.41e-10 ***
## LONG                   0.0093647  0.0066628   1.406   0.1643    
## c(LAT - LAT_sd4)       0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:c(LAT - LAT_sd4)  0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11
## mean lat + 2SD 
LAT_sd5<-mean(c3_plants$LAT) + 2*sd(c3_plants$LAT)
c3_plants_LONG.lm5<-lm(log10(C3+.1)~LONG*c(LAT-LAT_sd5), data=c3_plants)
summary(c3_plants_LONG.lm5)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * c(LAT - LAT_sd5), data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.0396593  0.0653846  -0.607   0.5461    
## LONG                   0.0213082  0.0106426   2.002   0.0492 *  
## c(LAT - LAT_sd5)       0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:c(LAT - LAT_sd5)  0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11

Note how the partial regression slope for LONG goes from -0.026 to +0.021 and remember this link from two weeks ago: Visualizing Relations in Multiple Regression


Linear Regression with Quadratic Term (Polynomial Regression)

Some times your trend isn’t quite a straight line. One way to deal with this is to add a quadratic term in your regression.

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x^2_{i1} + \epsilon_i \]

Example of polynomial regression

As an example, let’s look at an unpublished data set described in Sokal and Rholf (Biometry, 1997). These data show the frequency of the Lap94 allele in populations of Mytilus edulis and the distance from Southport.

blue mussel

blue mussel

Get the data

blue_mussel <- read.csv(file = "https://mlammens.github.io/ENS-623-Research-Stats/data/Logan_Examples/Chapter9/Data/mytilus.csv")
summary(blue_mussel)
##       LAP              DIST      
##  Min.   :0.1100   Min.   : 1.00  
##  1st Qu.:0.1600   1st Qu.:18.50  
##  Median :0.3150   Median :42.00  
##  Mean   :0.3171   Mean   :37.32  
##  3rd Qu.:0.4600   3rd Qu.:54.00  
##  Max.   :0.5450   Max.   :67.00

Note that the predictor variable is a frequency, and is bounded by 0 and 1. It is common (though controversial) to use an ArcSine transformation with frequency data. We’ll use the arcsine-square root transformation here (Logan, p. 69).

blue_mussel$asinLAP <- asin(sqrt(blue_mussel$LAP)) * 180/pi

Let’s visualize the data

ggplot(data = blue_mussel, aes(x = DIST, y = asinLAP)) +
  geom_point() +
  theme_bw() +
  stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Begin by building a simple linear regression model

blue_mussel_lm <- lm(data = blue_mussel, formula = asinLAP ~ DIST)
summary(blue_mussel_lm)
## 
## Call:
## lm(formula = asinLAP ~ DIST, data = blue_mussel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5594 -3.1252  0.9808  2.7398  6.6314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.08621    2.31643   6.944 4.70e-06 ***
## DIST         0.46731    0.05498   8.500 4.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.431 on 15 degrees of freedom
## Multiple R-squared:  0.8281, Adjusted R-squared:  0.8166 
## F-statistic: 72.24 on 1 and 15 DF,  p-value: 4.052e-07
ggplot(data = blue_mussel, aes(x = DIST, y = asinLAP)) +
  geom_point() +
  theme_bw() +
  stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Check diagnostics

plot(blue_mussel_lm)

Particularly pay attention to the residuals versus fitted values in the diagnostic plots.

Now, let’s build a polynomial regression model with the addition of a quadratic term

blue_mussel_lm2 <- lm(data = blue_mussel, formula = asinLAP ~ DIST + I(DIST^2))
summary(blue_mussel_lm2)
## 
## Call:
## lm(formula = asinLAP ~ DIST + I(DIST^2), data = blue_mussel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3994 -2.4435 -0.9135  2.8193  7.0820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.268605   3.131423   6.473 1.47e-05 ***
## DIST         0.091456   0.210714   0.434   0.6709    
## I(DIST^2)    0.005547   0.003017   1.839   0.0873 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.116 on 14 degrees of freedom
## Multiple R-squared:  0.8615, Adjusted R-squared:  0.8417 
## F-statistic: 43.54 on 2 and 14 DF,  p-value: 9.773e-07
ggplot(data = blue_mussel, aes(x = DIST, y = asinLAP)) +
  geom_point() +
  theme_bw() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) )

Is this a better model?

anova(blue_mussel_lm2, blue_mussel_lm)
## Analysis of Variance Table
## 
## Model 1: asinLAP ~ DIST + I(DIST^2)
## Model 2: asinLAP ~ DIST
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     14 237.22                              
## 2     15 294.50 -1   -57.277 3.3803 0.08729 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Marginally.

Let’s look at one more polynomial (cubic).

Challenge

Make the cubic model and test if it is a better fitting model than the linear or quadratic.

blue_mussel_lm3 <- lm(data = blue_mussel, formula = asinLAP ~ DIST + I(DIST^2) + I(DIST^3))
summary(blue_mussel_lm3)
## 
## Call:
## lm(formula = asinLAP ~ DIST + I(DIST^2) + I(DIST^3), data = blue_mussel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1661 -2.1360 -0.3908  1.9016  6.0079 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.2232524  3.4126910   7.684 3.47e-06 ***
## DIST        -0.9440845  0.4220118  -2.237  0.04343 *  
## I(DIST^2)    0.0421452  0.0138001   3.054  0.00923 ** 
## I(DIST^3)   -0.0003502  0.0001299  -2.697  0.01830 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.421 on 13 degrees of freedom
## Multiple R-squared:  0.9112, Adjusted R-squared:  0.8907 
## F-statistic: 44.46 on 3 and 13 DF,  p-value: 4.268e-07
ggplot(data = blue_mussel, aes(x = DIST, y = asinLAP)) +
  geom_point() +
  theme_bw() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) )

And is this model better?

anova(blue_mussel_lm3, blue_mussel_lm2)
## Analysis of Variance Table
## 
## Model 1: asinLAP ~ DIST + I(DIST^2) + I(DIST^3)
## Model 2: asinLAP ~ DIST + I(DIST^2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1     13 152.12                             
## 2     14 237.22 -1   -85.108 7.2734 0.0183 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(blue_mussel_lm3, blue_mussel_lm)
## Analysis of Variance Table
## 
## Model 1: asinLAP ~ DIST + I(DIST^2) + I(DIST^3)
## Model 2: asinLAP ~ DIST
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     13 152.12                              
## 2     15 294.50 -2   -142.38 6.0842 0.01365 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What about AIC?

AIC(blue_mussel_lm)
## [1] 102.729
AIC(blue_mussel_lm2)
## [1] 101.0522
AIC(blue_mussel_lm3)
## [1] 95.49808

We can also check the case of adding yet another polynomial factor, \(x^4\).

AIC(lm(data = blue_mussel, formula = asinLAP ~ DIST + I(DIST^2) + I(DIST^3) + I(DIST^4)))
## [1] 96.11887